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Extending  Lawson’s  Algorithm  to  Include 
the  Huber  M-Estimator 


Iain  J.  Anderson,  John  C.  Mason, 
and  Colin  Ross 


Abstract.  When  fitting  a curve  to  experimental  data,  there  is  no 
guarantee  that  the  data  obtained  are  as  accurate  as  might  be  expected. 
The  effect  of  outside  influences  may  cause  the  data  set  to  contain  outliers. 
These  outliers  can  have  a significant  effect  on  any  curve  which  is  fitted 
to  such  data.  The  foo-norm,  which  is  particularly  appropriate  for  fitting 
data  with  uniformly  distributed  errors,  is  extremely  sensitive  to  such  out- 
liers, since  it  minimises  the  maximum  error  from  the  data  to  the  curve. 
Therefore,  a technique  which  approximates  a data  set  using  the  foo-norm, 
without  being  adversely  affected  by  outliers,  would  be  a useful  addition  to 
the  array  of  tools  available.  We  present  numerical  examples  to  illustrate 
the  use  of  such  a technique  and  also  some  practical  applications  to  justify 
its  use. 


§1.  Introduction 

It  is  widely  accepted  that  the  ^oo-norm  is  the  most  appropriate  measure  of 
the  error  when  approximating  data  which  are  very  accurate  or  have  errors 
sampled  from  a uniform  distribution.  Unfortunately,  because  the  norm  is 
extremely  sensitive  to  outliers,  it  is  not  suitable  for  use  in  fitting  experimental 
data  containing  such  points.  Nevertheless,  it  may  be  the  case  that  the  tx- 
norm  is  the  most  appropriate  error  measure  for  the  non-outlying  data,  and  so 
we  present  an  algorithm  for  finding  an  fit  to  the  non-outliers  of  a data  set. 

The  algorithm  itself  is  based  on  a combination  of  the  Huber  M-estimator 
[6]  and  Lawson’s  algorithm  [7].  There  is  considerable  literature  on  both  tech- 
niques as  separate  subjects,  and  we  mention  here  only  a selection.  Lawson’s 
algorithm  was  first  analysed  by  Lawson  [7]  in  1961,  and  was  later  studied 
by  Rice  and  Usow  [11],  Cline  [2]  and  Ellacott  [4],  Similarly,  the  Huber  M- 
estimator  was  developed  by  Huber  [6]  in  1964  and  has  received  considerable 
attention  in  the  form  of  algorithms  for  its  solution  as  well  as  analyses  of  its 
behaviour.  Papers  by  Clark  and  Osborne  [1],  Ekblom  [3],  Madsen  and  Nielsen 
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[9],  Michelot  and  Bougeard  [10]  and  Li  [8]  all  look  at  the  Huber  M-estimator 
either  in  its  own  right  or  as  one  of  a class  of  robust  estimators. 

In  this  paper,  we  consider  the  problem  of  fitting  a function  of  the  linear 
form  f(x)  = YTj- i c]4>j{x)  to  a set  of  data  {(®i,  ?/i)}^=i>  where  the  {4>j}  are  a 
set  of  basis  functions.  To  this  end,  we  minimise  the  residuals  r,-  = y,  — f(xi). 
What  our  algorithm  achieves  in  practice  is  to  obtain  an  fit  for  those  r,  such 
that  |r^  | is  less  than  the  Huber  parameter  7,  say,  and  effectively  to  ignore  the 
remaining  data. 

The  circumstances  that  require  such  an  algorithm  occur  in  practice,  par- 
ticularly in  metrology  where  extremely  accurate  readings  can  be  obtained  (by, 
for  example,  a CD  reader)  but  are  subject  to  the  occasional  outlier  (due,  for 
example,  to  optical  effects).  These  outliers  usually  only  appear  in  groups  of 
one  or  two,  so  they  are  isolated,  which  leads  to  an  easier  problem  than  if  they 
appeared  in  larger  groups.  Another  metrological  situation  where  this  algo- 
rithm can  be  applied  is  in  the  measurement  of  a cylinder  in  an  automotive 
engine  where  there  is  approximately  95%  very  accurate  data,  and  5%  outliers. 
Naturally,  these  problems  might  require  a slightly  different  fitting  technique, 
but  this  algorithm  is  a useful  starting  point  from  which  more  general  fitting 
procedures  may  be  developed  in  future  work. 


§2.  Background 

In  this  section,  we  discuss  some  aspects  of  both  Huber  estimation  and  Lawson’s 
algorithm.  In  the  next  section  we  describe  how  to  combine  the  two  techniques 
to  create  a new  algorithm  which  satisfies  our  requirements. 


The  Huber  M-estimator 


The  Huber  M-estimator  is  based  on  the  Huber  function 


rt2/2,  if|<|<i, 

l|f|-l/2,  if  ]t|  > 1, 


(1) 


introduced  by  Huber  [6]  in  1964,  and  is  defined  in  the  following  straightforward 
way: 

m 

E = J2p{ri/'y),  (2) 

«=1 

where  rt  is  the  residual  in  the  ith  datum,  and  7 is  the  Huber  threshold  defining 
the  distinction  between  “accurate”  and  “inaccurate”  data. 

There  are  several  algorithms  to  solve  the  problem  of  minimising  (2)  with 
respect  to  c,  several  of  which  are  described  by  Li  [8],  However  in  this  paper, 
we  limit  ourselves  to  the  Newton  method.  This  involves  solving  [8] 


^ATDAp  = -Atv 
7 7 


at  each  iteration,  where  A is  the  design  matrix  with  entries  Atj  = <j>j{xi),  D 
is  a diagonal  matrix  with  entries  Da  = 1 if  |r,|  < 7 and  Da  = 0 if  |rt|  > 7, 
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and  v has  entries  Vi  = p'(ri/j).  Solving  this  system  gives  an  update  vector 
p which  should  provide  a better  estimate  c + p of  the  solution  parameters 
c*.  In  order  to  ensure  convergence,  we  also  incorporate  a line  search  which 
involves  finding  a scalar  a which  is  the  solution  to  the  equation 


(Apfp1 


+ aAp  'j 


= 0. 


Having  found  a,  we  then  obtain  a new  estimate  of  c*  by  setting  c :=  c + ap. 
We  repeat  this  procedure,  updating  D and  v as  necessary,  until  we  have 
obtained  c*  to  sufficient  accuracy. 


Weighting 

We  choose  to  generalise  (2)  by  introducing  weights  to  obtain  a weighted  Huber 
M-estimator  of  the  form  F = YaL i VJip{Tih)>  where  7 is  the  Huber  threshold, 
Wi  is  the  weight  associated  with  the  ith  datum,  and  r,  is  the  residual  associated 
with  the  it.h  datum.  It  may  be  necessary  to  introduce  weights  in  this  way  in 
order  to  deal  with  non-identically  distributed  errors  in  the  data,  in  which  case 
the  weights  may  be  chosen  to  be  the  reciprocals  of  the  standard  deviations  of 
the  underlying  probability  distributions. 

Many  algorithms  exist  to  find  unweighted  Huber  fits,  and  in  general, 
adapting  them  to  find  a weighted  Huber  fit  is  a straightforward  task.  As  an 
example,  we  show  how  to  adapt  a Newton-like  method. 

Weighted  Huber  algorithm. 

1)  Calculate  vt  = u>ip'(ri/7). 

2)  If  ^ATDwAp  = —ATv  is  consistent,  define  p :=  —^(ATDwA)+ATv, 

Otherwise,  define  p :=  — ~jP~1AT\,  where  P is  a positive  definite 
matrix. 

3)  Find  a steplength  a > 0 such  that  (Ap)T Dw p’ ((r  + aAp)/7)  = 0. 

4)  Set  c :=  c + ap. 

Here,  A is  the  m x n matrix  representing  the  underlying  linear  model,  Dw  is 
a diagonal  matrix  with  entries 


f Wi,  if  |rj/7|  < 1, 

to,  tfh/7|>i. 


P is  usually  the  identity  matrix,  I and  Y+  denotes  the  pseudo-inverse  of  a 
matrix  Y,  defined  so  that  Y+  is  that  matrix  X of  the  same  dimensions  as  YT 
such  that  YXY  = Y,  XYX  = X and  YX  and  XY  are  symmetric. 

We  note  here  that  there  are  many  other  algorithms  for  finding  a Huber 
fit,  and  that  most,  if  not  all,  can  be  adapted  just  as  easily. 
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Lawson’s  algorithm 

This  algorithm,  analysed  by  Lawson  [7]  in  1961,  enables  an  fit  to  be  ob- 
tained by  repeated  weighted  1 2 fits.  The  algorithm  itself  is  very  straightfor- 
ward, and  involves  updating  the  weights  at  each  iteration  according  to 


w. 


O+i) 


WiG(rf}) 
Er=i  WkG(r<»)’ 


(3) 


where  G(t)  = \t\.  The  denominator  is  a normalisation  term  to  ensure  that  the 
weights  sum  to  unity.  The  numerator  has  the  effect  of  weighting  data  with 
large  residuals  more  heavily,  with  the  result  that,  in  the  limit,  only  those  data 
with  a maximal  residual  will  have  any  weight  attached  to  them. 

Lawson’s  algorithm  finds  the  points  of  extreme  oscillation  and  weights 
these  accordingly  to  obtain  the  best  £<*,  approximation.  The  other  weights 
are  not  important,  and  in  fact  converge  to  zero. 

Initial  values  for  the  weights  are  usually  chosen  to  be  = 1/m,  as 
this  treats  all  the  data  equally  and  satisfies  the  condition  that  the  sum  of  the 
weights  must  be  unity.  Proofs  of  convergence  require  that  the  {<f>i(x)}  form 
a Chebyshev  set,  but  experimental  results  (see,  for  example,  [4])  suggest  that 
the  algorithm  is  more  generally  applicable. 

A summary  of  Lawson’s  algorithm. 

1)  Set  all  weights  equal  (with  the  sum  of  weights  equal  to  unity). 

2)  Perform  a weighted  least-squares  fit. 

3)  Calculate  the  residuals  from  the  weighted  least-squares  fit. 

4)  Update  the  weights  according  to  Lawson’s  formula  (3). 

5)  Return  to  Step  2 until  convergence  is  obtained. 


§3.  The  Algorithm 

We  are  concerned  with  the  solution  of  the  problem 


min  max  IrJ, 

c {ri:|ri|<7} 


where  r;  is  the  residual  for  the  datum  ( ),  and  7 is  the  Huber  threshold 
value.  In  order  to  solve  this  problem,  we  reformulate  it  as 

m 

i=l 

where  p is  defined  as  in  equation  (1),  and  we  adopt  an  iterative  procedure  to 
find  c by  performing  successive  weighted  Huber  fits.  The  weights  are  updated 
after  each  iteration  in  a manner  similar  to  Lawson’s  original  algorithm.  While 
Lawson’s  algorithm  is  concerned  with  finding  a minimax  fit  via  a sequence 
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of  weighted  least-squares  fits,  this  new  algorithm  finds  a minimax  fit  to  the 
non-outlying  data  via  a sequence  of  weighted  Huber  fits. 

Unfortunately,  Lawson’s  rule  for  updating  the  weights  cannot  be  used  in 
this  new  algorithm  since  the  rule  would  weight  the  outliers  too  heavily.  As 
a result,  the  outliers  would  be  fitted  more  accurately  at  the  next  iteration. 
The  essential  point  of  Lawson’s  update  is  to  weight  those  datum  points  which 
correspond  to  the  maximal  errors  of  the  minimax  fit.  To  maintain  this  general 
trend,  we  need  an  update  in  which  the  function  G in  (3),  rather  than  being 
monotonic,  instead  increases  to  a peak  and  then  decays,  with  the  peak  corre- 
sponding to  the  residual  with  the  largest  magnitude  which  does  not  exceed  7. 
The  latter  is  termed  the  “7-maximal  residual”  and  denoted  by  7 mr- 

The  function  we  choose  in  place  of  |f  | is  a negative  exponential  of  the 
form 

f 1*1,  if  1*1  < 7 MR, 

\ 7 if  \t\  > imr, 

and  we  update  the  weights  at  each  iteration  according  to  (3).  (Note  that  G ® 
changes  with  the  iteration  l.) 

For  |t|  > 7mr,  the  7 mr  factor  in  is  needed  to  ensure  continuity 

at  jr\p')|  = 7 Mr  and  the  —7^  term  in  the  exponential  is  used  so  that  the 
left  and  right  derivatives  of  Gi(t)  are  continuous  at  7 mr-  The  reason  for  this 
second  condition  is  to  ensure  that  points  with  residuals  just  over  7 mr  and 
those  with  residuals  just  less  than  7 mr  are  treated  similarly. 

§4.  Convergence 

We  have  obtained  favourable  results  with  this  algorithm,  provided  that  certain 
conditions  are  met.  Firstly,  the  form  of  the  approximating  model  needs  to  be 
appropriate.  For  example,  trying  to  approximate  a set  of  data  corresponding 
to  a quadratic  by  a straight  line  will  probably  lead  to  problems,  as  it  is  likely 
that  a considerable  number  of  the  data  will  be  treated  as  outliers.  Secondly, 
7 needs  to  be  chosen  carefully.  If  7 is  chosen  to  be  too  small,  then  there  may 
be  many  solutions  and  it  may  not  be  possible  to  predict  to  which  solution  the 
algorithm  will  converge  — if  it  converges  at  all. 

We  therefore  conclude  that  in  order  to  use  this  algorithm  effectively,  we 
first  need  to  have  some  details  of  the  problem  we  are  to  tackle.  If  we  are 
unsure  as  to  what  sort  of  model  to  fit  to  the  data,  then  7 should  be  chosen  to 
be  larger  than  we  might  initially  require.  If  we  are  unsure  what  value  of  7 to 
choose,  then  some  sort  of  7-reduction  procedure  may  be  effective  for  finding  an 
appropriate  value.  An  initial  value  of  7 may  be  chosen  by  use  of  the  formula 
7 = 1.9906  x median  (jr;  — median(r,)|)  (see,  for  example,  Ross  et  al,  [12]). 

The  effect  of  using  a Lawson-like  update  with  a non-monotonic  factor  is  to 
increase  the  weights  at  the  extrema  of  the  minimax  approximation  and  reduce 
all  other  weights,  including  those  of  the  outliers.  In  practice,  the  algorithm 
produces  a minimax  approximation  to  a subset  of  the  data  with  the  aim  that 
this  subset  should  be  the  non-outlying  data.  Unfortunately,  we  have  been 
unable  thus  far  to  prove  convergence  for  this  algorithm.  However,  it  should 
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be  noted  that  the  convergence  rate  would  be  expected  to  be  similar  to  that  of 
Lawson’s  original  algorithm  as  they  essentially  do  the  same  task. 


§5.  Acceleration  Schemes 

Although  the  algorithm  as  it  stands  is  acceptable  for  small  problems,  it  nev- 
ertheless takes  a considerable  length  of  time  to  achieve  relaxed  convergence 
conditions.  This  is  no  surprise  as  one  of  the  shortcomings  of  Lawson’s  al- 
gorithm is  its  slowness  to  converge.  More  specifically,  the  convergence  of 
Lawson’s  algorithm  is  linear  with  a ratio  of  t*  [11],  where 


r*  = max 


max  [r*  | 


: r < 1 


with  r*  defined  to  be  the  vector  of  residuals  from  the  optimal  £x  fit.  In 
many  situations,  this  ratio  can  be  very  close  to  one,  leading  to  rather  slow 
convergence. 

One  technique  to  increase  the  rate  of  convergence  is  to  use  the  fact 
that,  upon  convergence,  the  weights  corresponding  to  non-extremal  residu- 
als are  zero.  Specifically,  after  a set  number  of  normal  iterations  to  allow 
the  weights  to  settle  a little,  we  may  set  w;  = 0 if  |r,- 1 < rVIMloo,  where 

a = • This  latter  technique  is  the  one  presented  by  Rice 

and  Usow  [10],  although  Ellacott  [4]  found  that  it  could  cause  the  algorithm 
to  fail. 

Of  course  in  the  case  of  this  new  algorithm,  these  schemes  cannot  be 
applied  directly.  We  need  to  compensate  for  those  data  which  are  being  treated 
as  outliers,  thus  this  scheme  is  not  valid.  If  it  were  possible  to  find  some 
analogue  of  a for  this  new  algorithm,  then  it  may  be  possible  to  use  that 
analogue  in  an  acceleration  scheme. 


§6.  Numerical  Results 

We  have  tested  this  algorithm  extensively  and  now  present  some  numerical 
results  to  illustrate  it.  In  Figure  1,  we  show  a synthesised  data  set  consisting 
of  95  points  lying  close  to  the  polynomial  f(x)  = 2x2  — 3x  + 1 with  5 outliers. 
Figure  1 also  shows  the  best  fitting  quadratic  polynomial  to  the  data  obtained 
by  a least-squares  fit,  by  an  £ x fit  and  by  the  new  algorithm  presented  in  this 
paper.  The  noise  in  the  data  is  taken  from  a uniform  distribution  on  [—0.1,  0.1] 
and  we  thus  choose  7 = 0.1.  Table  1 shows  the  results  from  the  various  fits 
performed.  It  is  clear  that  both  the  1 2 and  i0 0 fits  are  unsuitable  and  are 
affected  by  the  outliers.  However,  the  new  algorithm  succeeds  in  identifying 
the  outliers  and  successfully  ignoring  them.  Comparing  the  results  from  the 
new  algorithm  with  those  from  performing  least-squares  and  minimax  fits  to 
the  data  without  outliers,  we  see  that  they  are  much  more  in  agreement.  In 
fact,  as  we  would  expect,  the  new  algorithm  has  generated  an  almost  identical 
fit  to  the  t?oo  fit  on  the  accurate  data. 
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Fig.  1.  Various  quadratic  polynomial  fits  to  a set  of  data  with  outliers. 

We  also  note  that,  while  the  new  algorithm  seems  to  be  significantly 
faster  in  this  example,  this  is  not  the  case  in  general.  In  fact  with  stricter 
convergence  criteria,  Lawson’s  original  algorithm  applied  to  the  non-outlying 
data  converges  in  fewer  iterations  than  the  new  algorithm.  The  reason  that 
the  minimax  fit  to  the  data  containing  outliers  takes  fewer  iterations  is  due  to 
the  result  in  Section  5 involving  r*,  which,  because  of  the  outliers,  is  actually 
quite  small  (r*  = 0.9497)  compared  to  r*  = 0.9894  for  the  case  of  the  new 
algorithm. 


t2 

New 

e2  (no) 

too  (NO) 

Co 

+0.9499 

+0.9839 

+0.9958 

+1.0059 

+0.9971 

Cl 

-2.7868 

-0.4571 

-2.9951 

-3.0007 

-3.0001 

C2 

+2.1568 

+2.0492 

+2.0057 

+2.0050 

+2.0035 

Iterations 

1 

46 

39 

1 

140 

Tab.  1.  Numerical  results:  fitting  a quadratic  (NO  : No  outliers). 

The  convergence  criterion  was  the  same  for  both  Lawson-like  algorithms, 
namely  that  the  magnitude  of  the  four  largest  7-maximal  residuals  should 
agree  with  a relative  error  of  less  than  10-2.  In  addition,  no  acceleration 
schemes  were  used  since  we  needed  to  obtain  a measure  of  how  fast  the  algo- 
rithms were  in  their  unaccelerated  form. 


§7.  Conclusions 

We  have  presented  an  algorithm  for  fitting  a linear  form  to  data  containing  uni- 
form noise,  contaminated  by  outliers.  Future  work  will  concentrate  on  three 
main  areas.  Firstly,  acceleration  of  the  convergence  of  the  algorithm.  Sec- 
ondly, extension  to  non-linear  forms.  Thirdly,  extension  to  general  £p  norms 
rather  than  solely  to  the  ^oo-norm. 


8 


I.  J.  Anderson,  J.  C.  Mason,  and  C.  Ross 


References 

1.  Clark,  D.  I.,  and  M.  R.  Osborne,  Finite  algorithms  for  Huber’s  M - 
estimator,  SIAM  Journal  on  Scientific  and  Statistical  Computing  7(1) 
(1986),  72-85. 

2.  Cline,  A.  K.,  Rate  of  convergence  of  Lawson’s  algorithm,  Mathematics  of 
Computation  26(117)  (1972),  167-176. 

3.  Ekblom,  H.,  A new  algorithm  for  the  Huber  estimator  in  linear  models, 
BIT  28  (1988),  123-132. 

4.  Ellacott,  S.  W.,  Line ar  Chebyschef  approximation,  M.Sc.  thesis,  Univer- 
sity of  Manchester,  UK,  1972. 

5.  Hermey,  D.,  and  G.  A.  Watson,  Fitting  data  with  errors  in  all  variables 
using  the  Huber  M-estimator,  SIAM  Journal  on  Scientific  Computing 
20(4)  (1999),  1276-1298. 

6.  Huber,  P.  J.,  Robust  estimation  of  a location  parameter,  Annals  of  Math- 
ematical Statistics  35  (1964),  73-101. 

7.  Lawson,  C.  L.,  Contributions  to  the  theory  of  linear  least  maximum  ap- 
proximation, Ph.D.  thesis,  University  of  California,  Los  Angeles,  CA, 
USA,  1961. 

8.  Li,  W.,  Numerical  algorithms  for  the  Huber  M-estimator  problem.  In 
Approximation  Theory  VIII,  C.  K.  Chui  and  L.  L.  Schumaker  (eds.), 
World  Scientific,  New  York,  NY,  USA,  1995,  325-334. 

9.  Madsen,  K.,  and  H.  B.  Neilsen,  Finite  algorithms  for  robust  linear  regres- 
sion, BIT  30  (1990),  682-699. 

10.  Michelot,  M.  L.,  and  M.  L.  Bougeard,  Duality  results  and  proximal  so- 
lutions of  the  Huber  M-estimator  problem,  Applied  Mathematics  and 
Optimization  30  (1994),  203-221. 

11.  Rice,  J.  R.,  and  K.  H.  Usow,  The  Lawson  algorithm  and  extensions,  Math- 
ematics of  Computation  22  (1968),  118-127. 

12.  Ross,  C.,  I.  J.  Anderson,  J.  C.  Mason,  and  D.  A.  Turner,  Approximating 
coordinate  data  that  has  outliers,  in  Advanced  Mathematical  and  Com- 
putational Tools  in  Metrology  IV,  P.  Ciarlini,  A.  B.  Forbes,  F.  Pavese, 
and  D.  Richter  (eds.),  World  Scientific,  Singapore,  2000,  210-219. 

I.  J.  Anderson,  J.  C.  Mason,  and  C.  Ross 
School  of  Computing  and  Mathematics 
University  of  Huddersfield 
Queensgate,  Huddersfield,  West  Yorkshire 
HD1  3DH,  UK 

i .  j . andersonQhud .ac.uk 

j .  c . masonShud .ac.uk 
c . rossQhud .ac.uk 


